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^ Open-v o cab u l a ry speech index i ng for voice an d vi deo mail retr i e val Q 
M. G. Brown, J. T. Foote, G. J. F. Jones, K. Sparck Jones, S. J. Young 

February 1997 Proceedings of the fourth ACM international conference on Multimedia 

Full text available: ^.p.df(1.82 MB) Additional Information: fu l l citation , references , ci t i n gs, index terms 



Keywords: audio indexing, browsing, content-based retrieval, information retrieval, speech 
recognition, word spotting 



^ Retr i eving spo k en d o cuments by c ombi nin g mu lt i p le in dex s o u r ce s 
G. J. F. Jones, J. T. Foote, K. Sparck Jones, S. J. Young 

August 1996 Proceedings of the 19th annual international ACM SIGIR conference on 
Research and development in information retrieval 

Full text available: ^ pdi(lJ 2 Additional Information: f u ll c itation , references , citin g s . I ndex terms 



^ Indexing ha n dwriting us i n g wo r d matchi ng 

R. Manmatha, Chengfeng Han, E. M. Riseman, W. B. Croft 

April 1996 Proceedings of the first ACM international conference on Digital libraries 

Full text available: S pd f ( 947 .23 KB ) Additional Information: full cit ati on , r e f erences , citings, ind ex ter m s 



4 Automatic speech reco gnit ion for g e ne ral i s ed time b as ed media re tri eva l an d in de xing 
John Robertson, Wai Yat Wong, Charles Chung, Dong Ki Kim 

Septennber 1998 Proceedings of the sixth ACM international conference on Multimedia 

Full text available: pdf(684.96 KB) Additional Information: full citatio n, references , i nde x te rms 



Keywords: approximate string matching, information retrieval, multimedia, speech 
recognition 



^ A system for retrieving speech documents 
Ulrike Glavitsch, Peter Schauble 

June 1992 Proceedings of the 15th annual international ACM SIGIR conference on 
Research and development in information retrieval 
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Full text available: ^ p df ( 94 1. 6 2 KB) Additional Information: ful l ci tati on, abstract , references , citin g s , index 

terms 

An infornnation retrieval model is presented for the retrieval of speech docunnents, i.e. audio 
recordings containing speech. The indexing vocabulary consists of indexing features that 
have the following characteristics. First, they are easy to recognize by speech recognition 
nnethods. Second, the number of different indexing features is small such that a reasonable 
amount of training data is sufficent to train the hidden Markov models that are used by the 
speech recognition process. Third, th ... 

Docuine nt ex p ansion for spe ech r etr i eval 
Amit Singhal, Fernando Pereira 

August 1999 Proceedings of the 22nd annual international ACM SIGIR conference on 
Research and development in information retrieval 

Full text available: ^ pdf(253.45 KB) Additional Information; full citation , references , citings , index terms 



Cross-language speech retrieval: establishing a baseline perfornnance 
Paraic Sheridan, Martin Wechsler, Peter Schauble 

July 1997 ACM SIGIR Forum , Proceedings of the 20th annual international ACM 

SIGIR conference on Research and development in information retrieval. 

Volume 31 Issue SI 

Full text available: "B pdf(1. 82 MB) Additional Information: f ul l ci t atio n , ref e renc e s , ci tin gs, in dex terms 



Dete c ting t o pi cal e v ents in di g ital video | 
Tanveer Syeda-Mahmood, S. Srinivasan 

October 2000 Proceedings of the eighth ACM international conference on Multimedia 

I- II* * -■ ui 0 ^Jlr>\ Additional Information: full citation, abstract, references, citings, index 

Full text available: 1m pdf(1 .04jyiB) ~ - ~ — 

terms 

The detection of events is essential to high-level semantic querying of video databases. It is 
also a very challenging problem requiring th^ detection and integration of evidence for an 
event available in multiple information modalities, such as audio, video and language. This 
paper focuses on the detection of specific types of events, namely, topic of discussion events 
that occur in classroom/lecture environments. Specifically, we present a query-driven 
approach to the detection of topic of ... 

Keywords: multi-modal fusion, query-driven topic detection, slide detection, topic of 
discussion events, topical audio events 



^ Vision: a digital video library 

Wei Li, Susan Gauch, John Gauch, Kok Meng Pua 

April 1996 Proceedings of the first ACM international conference on Digital libraries 

Full text available: ^ pdf(1.43 MB) Additional Information: full citation , references, citings , index terms 



Keywords: content-based indexing and retrieving, digital libraries, video and audio 
processing 



^0 To ward s robus t features for classifyin g aud i o in the CueVideo s y stem 
Savitha Srinivasan, Dragutin Petkovic, Dulce Ponceleon 

October 1999 Proceedings of the seventh ACM international conference on Multimedia 
(Part 1) 

Full text available: S pdf(867.70 KB) Additional Information: full citation , abstract, references, c itings, index 
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terms 

The role of audio in the context of multimedia applications involving video is becoming 
increasingly important. Many efforts in this area focus on audio data that contains some 
built-in semantic information structure such as in broadcast news, or focus on classification 
of audio that contains a single type of sound such as cleaar speech or clear music only. In 
the CueVideo system, we detect and classify audio that consists of mixed audio, i.e. 
combinations of speech and mus ... 

Keywords: audio segmentation and classification, speech/music discrimination 



"^"^ Ph oneti c confusio n matrix based spoken docume nt r et r ie val 
Savitha Srinivasan, Dragutin Petkovic 

July 2000 Proceedings of the 23rd annual international ACi^ SIGIR conference on 
Research and development in information retrieval 

Full text available- f ilpdf(714 16 KB) Additional Information: full citation , abstract , references , citings , index 
* LM^^™*^ = ter ms 

Combined word-based index and phonetic indexes have been used to innprove the 
performance of spoken document retrieval systems primarily by addressing the out-of- 
vocabulary retrieval problem. However, a known problem with phonetic recognition is its 
limited accuracy in comparison with word level recognition. We propose a novel method for 
phonetic retrieval in the CueVideo system based on the probabilistic formulation of term 
weighting using phone confusion data in a Bayesian framework. We eval ... 

FILOCHAT: handwritten notes provide access to recorded conversations 
Steve Whittaker, Patrick Hyland, l^yrtle Wiley 

April 1994 Proceedings of the SIGCHI conference on Human factors in computing 
systems: celebrating interdependence 

Full text available: "J pd f(848.13 KB ) Additional Information: f u l l ci tation, ref ere n ces , citings, index terms 



Keywords: audio, handwriting, indexing, notes, retrieval, speech-as-data 



^ Re tri eval effectiveness of an ont ology- base d model fo r info rmation selection 
Latifur Khan, Dennis McLeod, Eduard Hovy 

January 2004 The VLDB Journal — The International Journal on Very Large Data Bases, 

Volume 13 Issue 1 

Full text available: ^ p d f (278.74 KB) Additional Information: full ci tation , abstract . Index ternns 

Technology in the field of digital media generates huge announts of nontextual information, 
audio, video, and images, along with more familiar textual information. The potential for 
exchange and retrieval of information is vast and daunting. The key problem in achieving 
efficient and user-friendly retrieval is the development of a search mechanism to guarantee 
delivery of minimal irrelevant information (high precision) while insuring relevant 
information is not overlooked (high recall). The tradi ... 

Keywords: Audio, Metadata, Ontology, Precision, Recall, SQL 



Infornnation Retrieval and Text Mining: Advances in phonetic word spotting 
Arnon Amir, Alon Efrat, Savitha Srinivasan 

October 2001 Proceedings of the tenth international conference on Information and 
knowledge management 

Full text available: '^pdf(561.11 KB) Additional Information: ful l c itation, abs tract, references , in dex t e rms 

Phonetic speech retrieval is used to augment word based retrieval in spol<en document 
retrieval systems, for in and out of vocabulary words. In this paper, we present a new 
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indexing and ranking scheme using metaphones and a Bayesian phonetic edit distance. We 
conduct an extensive set of experiments using a hundred hours of HUB4 data with ground 
truth transcript and twenty-four thousands query words. We show improvement of up to 
15% in precision compare to results obtained speech recognition alone ... 

^ ^ N ew techniques for open-vocabular y spoken document ret rieva l 
Martin Wechsler, Eugen Munteanu, Peter Schauble 

August 1998 Proceedings of the 21st annual international ACM SIGIR conference on 
Research and development in information retrieval 

Full text available: ^ pdf(2Q4.51 KB) Additional Information: full citation , references , citings , index terms 



NewsComm: a hand-held interface for interactive access to structured audio 
Deb K. Roy, Chris Schmandt 

April 1996 Proceedings of the SIGCHI conference on Human factors in computing 
systems: common ground 

Full text available: 'ftpm24 MB}..1 Additional Information: full citation, references, citings, index terms 
html(3 6.39 K B ) " " ~ 



Keywords: audio interfaces, hand-held computers, structured audio 



7 Retrievin g a n d visua li z i n g v i deo 
Boon-Lock Yeo, Minerva M. Yeung 

December 1997 Communications of the ACM, volume 40 issue 12 

Full text available: ^pdf(2.01 MB) Additional Information: full c it a ti o n, r e f erences , c i tin gs, ind ex terms 



Video abstract ing Q 

Ralner Lienhart, Silvia Pfeiffer, Wolfgang Effelsberg 

December 1997 Communications of tlie ACM, volume 40 issue 12 

Full text available: ^ pd f(2 . 5 1 MB ) Additional Information: full c i tatio n , references , citin gs, i ndex ter ms 



SpeechSkimnner: a system for interactively skimming recorded speech | 
Barry Arons 

March 1997 ACM Transactions on Computer-Human Interaction (TOCHI), volume 4 issue 1 

Full text available- fig| pdf(1.03 MB) Additional Information: full citation , abstract , references , citings , index 
. {ems, review 

Listening to a speech recording is much more difficult than visually scanning a document 
because of the transient and temporal nature of audio. Audio recordings capture the 
richness of speech, yet it is difficult to directly browse the stored information. This article 
describes techniques for structuring, filtering, and presenting recorded speech, allowing a 
user to navigate and interactively find information in the audio domain. This article describes 
the SpeechSkimmer system for interacti ... 

Keywords: audio browsing, interactive listening, nonspeech audio, speech as data, speech 
skimming, speech user interfaces, time compression 



Spe ech , Au dio, Ge s t u re: SCAN Ma il: a voice ma il interface that m a k es s p eech 
bro wsable, readab l e a nd sea rc ha ble 

Steve Whittaker, Julia Hirschberg, Brian Amento, Litza Stark, Michiel BacchianI, Philip 
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Isenhour, Larry Stead, Gary Zamchick, Aaron Rosenberg 

April 2002 Proceedings of the SIGCHI conference on Human factors in computing 
systems: Changing our world, changing ourselves 

Full text available: pdf(540 75 KB) Additional Information: full citation , abstract , references , citings , index 
^ terms 

Increasing amounts of public, corporate, and private speech data are now available on-line. 
These are limited in their usefulness, however, by the lack of tools to permit their browsing 
and search. The goal of our research is to provide tools to overcome the inherent difficulties 
of speech access, by supporting visual scanning, search, and information extraction. We 
describe a novel principle for the design of UIs to speech data: What You See Is Almost 
What You Hear (WYSIAWYH), In WY5I 1 

Keywords: "speech as data", asynchronous: communication, empirical evaluation, speech 
access, voicemail, what you see is almost what you hear 
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Ke y t o e ffective vid eo ret r iev al: effect iv e catal o gi n g and browsin g Q 
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September 1998 Proceedings of the sixth ACM international conference on Multimedia 

Full text available: S pdf(1.03 MB ) Additional Information: fu ll citat i o n, re f e r ences , c itings. In dex terms 



Keywords: cataloger, digital library creation, nnultiview storyboard, speech recognition, 
video annotation, video search and browse, video segmentation 
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October 1993 ACM Transactions on Information Systems (TOIS), volume ii issue 4 
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Full text available: 'g pdf ( 1. 78 MB ) 



Additional Information: full citation . aMLract, referen ces , citin gs, ind ex 
terms 



Although talking is an integral part of collaboration, there has been little computer support 
for acquiring and accessing the contents of conversations. Our approach has focused on 
ubiquitous audio, or the unobtrusive capture of speech interactions in everyday work 
environments. Speech recognition technology cannot yet transcribe fluent conversational 
speech, so the words themselves are not available for organizing the captured interactions. 
Instead, the structure of an int ... 



Keywords: audio interactions, collaborative work, multimedia workstation software, semi 
structured data, software telephony, stored speech, ubiquitous computing 



23 Semantic speech editing 
Steve Whittaker, Brian Amento 

April 2004 Proceedings of the 2004 conference on Human factors in computing 
systems 

Full text available: ^ pdf(53 2 .2 2 K B) Additional Information: f ull citat ion, abstr act, referen ces 

Editing speech data is currently time-consuming and error-prone. Speech editors rely on 
acoustic waveform representations, which force users to repeatedly sample the underlying 
speech to identify words and phrases to edit. Instead we developed a semantic editor that 
reduces the need for extensive sampling by providing access to meaning. The editor shows 
a time-aligned errorful transcript produced by applying automatic speech recognition (ASR) 
to the original speech. Users visually scan the words ... 

Keywords: acoustic representations, speech browsing, speech editing, speech recognition. 
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A multimodal learning interface for grounding spoken language in sensory perceptions Q 
Chen Yu, Dana H. Ballard 

July 2004 ACM Transactions on Applied Perception (TAP), volume i issue i 

Full text available: ^ pdf(1.73 MB) Additional Information: full citation , abstract , references , index terms 

We present a multimodal interface that learns words from natural interactions with users. In 
light of studies of human language development, the learning system is trained in an 
unsupervised mode in which users perform everyday tasks while providing natural language 
descriptions of their behaviors. The system collects acoustic signals in concert with user- 
centric multisensory information from nonspeech modalities, such as user's perspective 
video, gaze positions, head directions, and hand moveme ... 

Keywords: {Multimodal learning, cognitive modeling, multimodal Interaction 



25 Speech and gaze: A multimodal learning interface for grounding spoken language in Q 
se nsory perceptions 
Chen Yu, Dana H. Ballard 

Novennber 2003 Proceedings of the 5tli international conference on i^ultimodal 
interfaces 

Full text available: ^ pdf(849.56 KB) Additional Information: full citation. abMract, r eference s, inde x te rm s 

Most speech interfaces are based on natural language processing techniques that use pre- 
defined symbolic representations of word meanings and process only linguistic information. 
To understand and use language like their human counterparts in multimodal human- 
computer interaction, computers need to acquire spoken language and map it to other 
sensory perceptions. This paper presents a multimodal interface that learns to associate 
spoken language with perceptual features by being situated in users ... 

Keywords: language acquisition, machine learning, multimodal integration 



Associating cooki n g video with related textbook 
Reiko Hamada, Ichiro Ide, Shuichi Sakai 

Novennber 2000 Proceedings of the 2000 ACM wori<shops on Multimedia 

Full text available: ^ pdf(1.03 MB) Additional Information: full citation , abstract , references , index terms 

We have been handling video with supplementary documents, such as cooking programs, 
and are working on integration of such media. Through the integration, many applications 
will become possible, for example, reconstruction of multimedia data that supplement the 
information of each medium, construction of interactive database, or kitchen automation. 
Until now, we have proposed an integration system that perform integrative analysis of 
image, audio and text and associate each other. In this pap ... 

Distributed design review in virtual environments 

Mike Daily, Mike Howard, Jason Jerald, Craig Lee, Kevin Martin, Doug Mclnnes, Pete Tinker 
September 2000 Proceedings of the third international conference on Collaborative 
virtual environments 

Full text available: HI pdf{1 .25 MB) Additional Information: fulLd^^^^^^^ aMract, reference_s, citings, index 
' ^ *■ " term s 

In large distributed corporations, distributed design review offers the potential for cost 
savings, reduced time to market, and improved efficiency. It also has the potential to 
improve the design process by enabling wider expertise to be incorporated in design 
reviews. This paper describes the integration of several components to enable distributed 
virtual design review in mixed multi-party, heterogeneous multi-site 2D and immersive 3D 
environments. The system provides higher Jayers of sup ... 
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Keywords: design review, global scale collaboration, multi-modal, spatialized audio, speech 
recognition, tele-conferencing, virtual environments 



Associating video with related documents 

Reiko Hamada, Ichiro Ide, Shuichi Sakai, Hidehiko Tanaka 

October 1999 Proceedings of the seventh ACM international conference on Multimedia 
(Part 2) 

Full text available: ^ pdf(647.34 KB) Additional Information: full citation , references , citings, index terms 



29 D ocument ima g e unders t and ing: Modeling c o n tent iden tifi ca t i o n fro m documen t 
i ma g es 

Takehiro Nakayama 

October 1994 Proceedings of the fourth conference on Applied natural language 
processing 

Full text available: rai)df(555,79 KB) 

S Additional Information: full citation , abstract , references 

^ Publisher Site 

A new technique to locate content-representmg words for a given document image using 
abstract representation of character shapes is described. A character shape code 
representation defined by the location of a character in a text line has been developed. 
Character shape code generation avoids the£computational expense of conventional optical 
character recognition (OCR). Because character shape codes are an abstraction of standard 
character code (e.g., ASCII), the mapping is ambiguous. In this p ... 



Speec h-a s- data technologies for personal infor ma tio n d ev ic es 
Roger C. F. Tucker, Marianne Hickey, Nick Haddock 
May 2003 Personal and Ubiquitous Computing, volume 7 issue i 

Full text available: ^ pdf( 3 12. 9 2 KB) Additional Information: fu ll citation , abstract , index terms 

For small, portable devices, speech input has the advantages of low-cost and small 
hardware, can be used on the move or whilst the eyes & hands are busy, and is natural and 
quick. Rather than rely on imperfect speech recognition we propose that information entered 
as speech is kept as speech and suitable tools are provided to allow quick and easy access 
to the speech-as-data records. This paper summarises our work on the technologies needed 
for these tools - for organising, browsing, ... 

Keywords: Audio summarisation, Speech compression. Speech recognition, Speech-as- 
data, Wordspotting 
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January 1995 Proceedings of tiie third ACM international conference on Multimedia 
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Keywords: ATM, atnn, browsing, content-bdsed retrieval, information retrieval, multimedia, 
television news, text subtitles 
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This paper describes the ACM Multimedia '94 Conference Workshop on Multimedia Database 
Management Systems held on 21 October 1994 in San Francisco, California. The workshop 
consisted of four sessions: designing multimedia database management systems, video and 
continuous media service, multimedia storage and retrieval management, and miscellaneous 
topics in multimedia data management. The workshop concluded with a discussion session 
on directions for multimedia database management. Twenty ... 

33 SpeechSkinnmer: interactively skimnning recorded speech 
Barry Arons 

December 1993 Proceedings of the 6th annual ACM symposium on User interface 
software and technology 

Full text available: ^ pdf (1.13 MB ) Additional Information: full citation , references , citing s, ind e x te r m s 
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Keywords: browsing, interactive listening, non-speech audio, speech as data, speech 
detection, speech skimming, speech user interfaces, time compression 



3^ Sp o ke n dialogue techn ol ogy: en a bli n g th e convers a tiona l user i nterf ace Q 
March 2002 ACM Computing Surveys (CSUR), volume 34 issue i 

Full text available- HI pdf(987 69 KB) Additional Information: MLcitatjon, abstract, Merwices, citings, index 
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Spoken dialogue systenns allow users to interact with computer-based applications such as 
databases and expert systenns by using natural spoken language. The origins of spoken 
dialogue systenns can be traced back to Artificial Intelligence research in the 1950s 
concerned with developing conversational interfaces. However, it is only within the last 
decade or so, with major advances in speech technology, that large-scale working systems 
have been developed and, in some cases, introduced into commerc ... 

Keywords: Dialogue management, human computer interaction, language generation, 
language understanding, speech recognition, speech synthesis 
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35 String Mat ch and Text E xtract i on : Improved string m atc hing under n o isy c h an n el Q 

c o nditio ns § 

Kevyn Collins-Thompson, Charles Schweizer, Sfusan Dumais 

October 2001 Proceedings of the tenth international conference on Information and 
Icnowiedge management |, 

Full text available:^ pdf(1. 7 1 MB) Additional Information: full citat ion, abstract, ref erenc es, in d ex ter m s 

Many document-based applications. Including popular Web browsers, email viewers, and 
word processors, have a 'Find on this Page' feature that allows a user to find every 
occurrence of a given string in the document. If the document text being searched is derived 
from a noisy process such as optical character recognition (OCR), the effectiveness of typical 
string matching can be greatly reduced. This paper describes an enhanced string-matching 
algorithm for degraded text that improves recall, whi ... 

Keywords: approximate string matching, information retrieval evaluation, noisy channel 
model, optical character recognition > ; 
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